Skip to content

Fix some threading issues (some free-threading specific)#2162

Draft
seberg wants to merge 4 commits into
NVIDIA:mainfrom
seberg:threading-fixes
Draft

Fix some threading issues (some free-threading specific)#2162
seberg wants to merge 4 commits into
NVIDIA:mainfrom
seberg:threading-fixes

Conversation

@seberg

@seberg seberg commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

This fixes a few threading issues, but we may want to discuss some details still.

  • The GraphNode cleanup order is an important fix: Another thread may end up with the same pointer (but new object) as soon as we clean it up. So we have to remove it from the cache before cleaning it up.
  • Use of atomics: I think this is needed, but for this one place an atomic seemed more reasonable. (However, hard to test and if it can fail IIUC only on ARM.)
    (A critical section would work as well, since it is a lock which enforces order.)
  • The critical sections should be pretty safe. I am not sure they will all ensure that the object is always the identity but I am pretty sure it protects from worse races.
    pytest-run-parallel did find this for MemPool.attributes and I could not reproduce after adding it. The other critical sections are added with a code-search, though.
  • The split mutex: This is thread-unsafe and failed threaded testing. But I am honestly not sure if that isn't just expected, or whether the mutex is good but it should also be safe from within CUDA?
  • Use of setdefault cached pattern is largely just normalizing. Without the return dict.setdefault a different instance may be returned on different threads (or a cache entry replaced). For the cyGraphMemoryResource that triggered a test with pytest-run-parallel although that doesn't mean it is problematic as such. cuda-pathfinder uses functools.cache, but usually for strings; the one we may want to look at is load_nvidia_dynamic_lib.

Checklist

Implementing gh-2114 would add implicit coverage for most but the atomics and some of the critical sections (if it adds it might be so rare to effectively never trigger). So this is split out.
(If preferred I can add some explicitly threaded tests.)

I have no full runs with pytest-run-parallel, but I want to clean up the test harness a bit more. I also still should test without free-threading (useful to flush out forgotten with nogil: blocks that can cause deadlocks).

@copy-pr-bot

copy-pr-bot Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label Jun 2, 2026
if _can_use_structured_sm_split():
return _split_with_general_api(self, opts, dry_run)
# SplitByCount requires the same 12.4+ as green ctx support (already checked above)
return _split_with_count_api(self, opts, dry_run)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointing this out inline, because this is probably the thing I am least sure about. Should SMResource.split() be thread-safe if called on the same resource from multiple threads?

@seberg seberg force-pushed the threading-fixes branch from 304e64e to 3a7b314 Compare June 2, 2026 16:34
@seberg seberg self-assigned this Jun 2, 2026
@seberg seberg added this to the cuda.core next milestone Jun 2, 2026
@seberg seberg added the bug Something isn't working label Jun 2, 2026
@leofang leofang added the P0 High priority - Must do! label Jun 5, 2026
@leofang leofang requested review from Andy-Jost and leofang June 8, 2026 01:36
@mdboom

mdboom commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This should probably be a separate PR, but are there best practices for CI to help ensure these sorts of issues don't come back? Should we be running pytest-run-parallel on our 3.14t configurations, for example?

@seberg

seberg commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Should we be running pytest-run-parallel on our 3.14t configurations, for example?

Yeah, I just thought I'd split out this PR first and get it in and then open a PR for that. Running pytest-run-parallel is +maybe not a huge PR but also a non-trivial PR because we probably want a "mini pytest plugin" for init_cuda style logic.

(ideally we use pytest-run-parallel together with a TSAN environment, I haven't actually done TSAN yet locally. Since we have ASAN jobs, I think that may well be very doable.)

seberg added 4 commits June 11, 2026 15:09
This fixes a few threading issues, but we may want to discuss some
details still.
* The GraphNode cleanup order is an important fix. Another thread may
  end up with the same pointer (but new object) as soon as we clean it
  up.  So we have to remove it from the cache before cleaning it up.
* Use of atomics: I think this is needed, but for this one place
  an atomic seemed more reasonable.  (However, hard to test and if
  it can fail IIUC only on ARM.)
* The critical sections should be pretty safe.  I am not sure they
  will all ensure that the object is always the _identity_ but I am
  pretty sure it protects from worse races.
  (Testing did find this for MemPool.attributes, not others yet.
  Testing with thread-sanitizer might flush out some...)
* The split mutex: This is thread-unsafe.  But I am honestly not
  sure if that isn't just expected, or whether the mutex is good
  but it should also be safe from within CUDA.
* Use of `setdefault` cached pattern is largely just normalizing.  Without
  the `return dict.setdefault` a different instance may be returned on
  different threads (or a cache entry replaced).
  For the `cyGraphMemoryResource` that triggered a test with pytest-run-parallel
  although that doesn't mean it is problematic as such.
  `cuda-pathfinder` uses functools.cache, but usually for strings;
  the one we may want to look at is `load_nvidia_dynamic_lib`.

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
…s good...

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
@seberg seberg force-pushed the threading-fixes branch from 04c3862 to f5b239c Compare June 11, 2026 13:10
@rparolin

Copy link
Copy Markdown
Collaborator

It would be great if at least the critical section and atomics changes were broken out into a separate PR.

@seberg

seberg commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

OK, I split out most things and will update after that (or split out the final chunks), until then marking as draft.

What is remaining is the GraphNode cleanup order and the mutex for SMResource.split().

@seberg seberg marked this pull request as draft June 15, 2026 10:19

@Andy-Jost Andy-Jost left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just commenting (not approving) because this PR is marked draft (I think it will be split).

I'm fine with these fixes as they are, though I'd also like to explore more structural fixes to decrease the likelihood problems are reintroduced. None of that blocks this change for me. Thematically, I think it amounts to moving more things into the C++ layer. Some ideas below from my bot (where Level 1 and Level 2 refer to the registry levels documented in REGISTRY_DESIGN.md):

There are more structural options, and they're largely aligned with where the handle layer was already heading. This PR is tactical patching; the strategic fix is centralize lifetime + identity + lazy state in C++ boxes/factories, and make Cython thin.

  1. Push lazy-init into Boxes (replaces atomics/critical sections for CUDA metadata)
    A lot of what got @cython.critical_section / atomics is really "query driver once per handle": Buffer mem attrs, pool attrs, kernel attrs. That belongs in the box next to the resource, with std::once_flag or a box mutex:
  struct DevicePtrBox {
      CUdeviceptr resource;
      mutable MemAttrs attrs;
      mutable std::once_flag attrs_init;
  };

Cython properties become return box->get_attrs() — no Python-side check-then-act. This matches the existing handle design: state lives with ownership, not with the Python wrapper.

  1. Unified C++ factory + destroy protocol (fixes L1 ordering in one place) create_graph_node_handle already owns Level 1. Extend it to own the full lifecycle:
  create_graph_node(CUgraphNode, GraphHandle) → GraphNodeHandle + Python wrapper
  destroy_graph_node(GraphNodeHandle) → { pop L2, unregister L1, null resource, cuGraphDestroyNode }

All under one C++ mutex. Then Cython GraphNode.destroy() is a one-liner. The invalidate-then-destroy rule becomes impossible to get wrong at call sites.

  1. Templatized Level-2 / as_py identity cache in C++
    REGISTRY_DESIGN.md already cites pybind11's registered_instances. A generic:
  template<typename Handle>
  class PyIdentityRegistry {
      // key: owner identity (owner_before / box pointer)
      // value: PyObject* via weakref
  };

could serve GraphNode, Kernel, Event, etc. Key on handle ownership, not raw CU*. Registration happens in the factory; unregistration in the destroy protocol. Cython stops maintaining _node_registry by hand.

as_py(h) in DESIGN.md was meant to return Python wrappers — today GraphNode's as_py still goes to cuda.bindings. Evolving it to return the cached cuda.core object would collapse L2 into the handle layer.

  1. What you still can't fully move out of Cython
    • Polymorphic construction: GN_create_impl dispatches to EmptyNode, KernelNode, … C++ either needs a typed factory table or a Cython callback registered at init — still one entry point, but not zero Cython.
    • Pure-Python lazy caches: IPCDataForBuffer, DLPack/StridedMemoryView inference, ComputeCapability tuples — these aren't handle metadata; they'll need Python locks or redesign.
    • APIs that aren't handle-backed: SMResource.split() — mutex or a documented "caller serializes" contract unless the driver guarantees more.

@seberg

seberg commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Not sure I am following all of it, but C++ threading does seem much more mature than Python/Cython so pushing more to C++ is probably a good idea.
Some notes/thoughts:

  • In most cases here the critical_section is on functions that return Python objects (one where it wasn't uses the atomic, I forgot to delete it), so there are two considertions:
    1. Creating the object is quick but not free performance wise.
    2. It may be nice if things are idempotent which a C++ side re-wrapping can't solve?
      (But: I don't think the @critical_section doesn't guarantee idempotence fully.)
      I am slightly wondering if we couldn't create a tiny pattern ourselves for this but not sure how nice it'll be my best brainstorming is to create a C++ helper (mirroring dicts):
      cdef atomic_object:
          cdef get(..., order=something):
          cdef set(..., order=something):
          cdef object setdefault(..., order=something):
      
      but if we go down that route it may make more sense to talk to Cython folks. I think Cython support will improve.
  • Moving the as_py adds the need to deal with cleaning up weakrefs, but I suspect it is a good idea even just to not split the logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants